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The motivation of this study is to develop a compact offline recognition 
model for Khmer handwritten text that would be successfully applied under 
limited access to high-performance computational hardware. Such a task aims 
to ease the ad-hoc digitization of vast handwritten archives in many spheres. 
Data collected for previous experiments were used in this work. The one- 
against-all classification was completed with state-of-the-art techniques. A 
compact deep learning model (2+1CNN), with two convolutional layers and 
one fully connected layer, was proposed. The recognition rate came out to be 
within 93-98%. The compact model is performed on par with the state-of-the- 
art models. It was discovered that computational capacity requirements 
usually associated with deep learning can be alleviated, therefore allowing 
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applications under limited computational power. 
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1. INTRODUCTION 

Khmer is an official language of Cambodia, spoken by about 16 million people. It has an Alpha 
syllabary (Abugida) writing structure: words are comprised of syllables, most of which consist of a radical 
for a consonant and additional score for vowels. The modern Khmer alphabet consists of 33 consonants. 

There is a great demand for a recognition system reflecting Khmer writing specifics due to the 
constant accumulation of documents in such spheres as government, healthcare, finance, education. Until the 
early 2000 s, most records in the government and private sectors in Cambodia have been held on handwritten 
documents and hand-filled forms. One has to manually browse through the entire mass of paper to reach any 
of these records. The bulk of such tasks is extremely complex to carry out on daily basis, even with help of a 
systematic archiving system. Having an effective deep learning [1] application for digitizing handwritten text 
is particularly important for promoting the development of public and private services. Such an application 
also needs to be inexpensive and applicable in developing economies. 

As opposed to other common alphabetical systems, there is a very small amount of research on 
Khmer text recognition. Most of the efforts have been done only within the past decade [2]-[8]. Sok and 
Taing [3] and Srun and Vyshyakov [4], [5], [7] studied recognition of the Khmer printed text. Ye et al. [8] 
developed an online recognition method for printed text in the Khmer, Bangla, and Myanmar alphabets. The 
amount of work in the field, as well as the nature of the collected data for relevant experiments, describes the 
current state of the art for Khmer handwritten text recognition (HTR). Most of the data used in the past 
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experiments were printed (Machine-derived) text, which greatly impedes the development of an accurate 
application. 

As an extension of previous experiments [9], [10], current work implemented CNNs for Khmer 
HTR. A novel, compact model 2+1CNN was proposed to be used alongside the models used in literature 
(LeNet-5 [1], AlexNet [11], visual geometry group 16 (VGG16), VGG19 [12], ResNet [13]). 2+1CNN is 
designed for binary classification while existing models were optimized accordingly due to the adapted one- 
against-all tactic used throughout the work. 

To increase overall performance, an independent network was trained and evaluated for each class. 
One particular class was taken as "positive" and all others — as "negative" while training each network. That 
is, given a set of classes C = {C1,C2,....Cx}, the samples of class c; were isolated and all samples of other 
Classes Cy, C2, ....Cj-1, Cj41)+» Cx were considered as “not c;" (or c;’). Training cable news network (CNN) 
with this setting yielded a classifier model F;(-). The output of the training process was the combination of all 
trained classifiers: 


FO) = {ROL =1.-4} (1) 


Intuitively, the final model was designed to iterate the question “Are you of class c;?" instead of 
asking directly "What class are you?" This work aims to design a compact model for the Khmer HTR system. 
Lack of appropriate datasets contributes to its difficulty. Only datasets collected in preliminary experiments 
[9], [10] were used. 


2. RELATED WORK 
2.1. Recognition of Khmer handwriting 

Meng and Morariu [2] described how to combine feedforward artificial neural network (ANN) with 
a self-organizing map (SOM) to design a recognition system for printed Khmer characters. Sok and Taing [3] 
described their experiment with SVM on printed Khmer characters. Font size-based accuracy and CPU load 
were presented as efficiency assessment. Authors also listed some scarce work done towards Khmer optical 
character recognition (OCR) to emphasize on lack of research for the Khmer language. Backpropagation was 
used by Srun [4] to train a classifier to recognized Khmer characters. For the experiments, Srun sampled 
printed text. Preprocessing consisted of resizing images to standard dimensions. Thumwarin et al. [6] in their 
studies implemented finite impulse response (FIR) to extract features from handwritten Khmer characters and 
sent their results to a Euclidean-based classifier. The work relies on temporal information, which is 
impossible to collect from a scanned image of a manuscript. Another problem that the method requires extra 
hardware for collecting temporal information. Another work by Srun and Vishnyakov [7] included the 
implementation of classifiers in TESSERACT and further improvement of recognition quality of scanned 
characters. The earliest mention of Khmer HTR in a computerized setting dates as early as 2008 in work by 
Ye et al. [8], which proposed a recognition system of scripts like Myanmar, Khmer, and Bangla [8]. Research 
data was collected by the means of drawing characters with a mouse, which is also a drawback of the work. 
Unlike many previous attempts, data used in the current work reflects the nature of common handwriting 
which makes resultant models more realistic. Khmer datasets acquired in previous attempts are compared in 
Table 1. 


Table 1. Data sets were acquired for Khmer HTR 


Literature Dataset Data and size 
Sok and Taing [3] Printed and scanned text Khmer Characters, 3000 
Ye et al. [8] Collected by mouse, stylus pen Khmer, 135, Myanmar Characters, 107 
Thumwain et al. [6] Scanned text Khmer letters and digits, 6750 
Kruy and Kameyama [14] Printed and scanned text Khmer words, 1104 

Meng and Morariu [2] Printed and scanned text Khmer Characters, 215 
Kheang et al. [15] Printed and scanned text Khmer words, 110713 
Srun [4] Printed and scanned text Khmer Characters, 33 


2.2. Convolutional neural networks 

A convolutional neural network (CNN, or ConvNet) is a special kind of deep, feed-forward artificial 
neural network. In an image processing application, CNNs learns directly from images. Key concepts, 
important in the description of CNN, are local receptive fields (LRF), shared weights and biases, activation, 
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and pooling. CNNs also differ from each other in the method and objective of training, e.g., prediction, object 
discovery, segmentation. 

According to Cun [1], [16], CNN is a variation of multilayer perceptron which require minimal 
preprocessing requirements. The connectivity pattern between neurons in a CNN is inspired by the biological 
processes of the animal visual cortex, where each cortical neuron responds to signal from only a restricted 
area of the visual field (receptive field). Matsugu et al. [17] described that receptive fields that connect to 
different neurons, partially overlap. This leads to having the entire visual field covered and, therefore, to 
smooth vision. 

Figure | shows an example of a three-dimensional neuron arrangement in a convolutional neural 
network. Every layer takes a three-channel image, where each pixel has a separate value for red, green, and 
blue components. The image is split to form output in form of a 3D matrix of neurons. Data used in this study 
was preprocessed into grayscale images. 


SEER 


QQQOOK) 


Figure 1. 3-D neuron arrangements in a CNN [18] 


The convolution operation is performed on the input data. This step models the response of an 
individual biological neuron to visual input. The activation step applies a transformation to the output of each 
neuron by using activation functions. Rectified linear unit (ReLU), is an example of a commonly used 
activation function. It takes the output of a neuron and maps it to the highest positive value. If the output is 
negative, the function maps it to zero. 

The output of the activation step can be further transformed by applying a pooling step. Pooling 
reduces the dimensionality of the feature map by condensing the output of small regions of neurons into a 
single output. This helps to simplify the consequent layers and reduces the number of parameters that the 
model needs to learn. CNN layers are configured by these three concepts. A CNN can have tens or hundreds 
of hidden layers that each learns to detect different features of an image. In such feature maps, every hidden 
layer increases the complexity of the learned image features. For example, the first hidden layer learns how 
to detect edges, and the last layer learns how to detect more complex shapes. 

In CNN inputs from a small local receptive field (LRF) are connected to one neuron hidden layer. 
LRF is translated across an image to create a feature map from the input layer for being used in the hidden 
layers. Convolutions are used to implement this process efficiently [19]. A convolution operation is applied 
to the input of each layer. The convolution mimics the reaction of neurons to visual input. CNN architecture 
also includes pooling layers, that are used to group the outputs of one layer into a single neuron in the next 
layer [11], [20]. The cluster of neurons is designed in form of square batches of any size nxn, 
where n = 2,3,4,.... 

In some cases, pooling batches need to be moved beyond the boundaries of a sample image, which 
may cause ambiguity in the training process as well as computational and programmatic complexity. 
Extending the image by several rows and columns of pixels to match the size of pooling batches (padding) 
helps to overcome such a problem. The values used for the extra pixels may be taken differently: average 
overall spectrum of pixel values (average padding), zeros (zero paddings). Denoting filter size as F, input 
size as W, resulting in image size as R, padding size as P, and stride size as S, it is obvious that the size of 
the sample after each pooling layer will become is being as forms, which can also be deducted for two 
dimensions: 


eS. SW PP) 
Rey 3 


= (2) 
aS . +1, otherwise 
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3. RESEARCH METHOD 

Current experiments were based on the same data set and most preprocessing steps [9], [10]. Later, 
the potential to highly increase the recognition rate of neural networks was explored [21]. Figure 2 shows the 
development of the Khmer HTR framework. Data collection and preliminary experiments were completed in 
our previous work [9], [10]. In preliminary experiments, the number of features was reduced by 90% using 
three independent methods: correlation-based feature selection (CORR), two-dimensional Fourier transform 
(FT2D, and Gabor filters (GF). The result of each method was classified with an artificial neural network 
(ANN). The original data, without feature space transformation, was classified for comparison of 
performance. Gabor Filters yielded the highest improvement in recognition. Such a fact suggested that filters 
may play an important role in feature extraction. The current study is based on convolutional models, which 
rely on a wider variety of filters. In the course of current work, Models LeNet-5, AlexNet, VGG16, VGG19, 
ResNet50 have been modified for binary classification. 


Data Collection 
Feature Space Transformation 

1. Cropping 
Correlation Gabor 


Preliminary Work Current Work 


Deep Learning 


ResNet 


Raw Corr FT2D Gabor 
~ 4 + + 
ANN ANN ANN ANN 


Evaluation 


Cross-validation 


Figure 2. Research framework 


3.1. One-against-all tactic 

Khmer samples of one consonant were taken as positive class and the ones of remaining consonants 
— as negative class. Such practice is called a two-way classification for having only two classes to recognize 
from: “positive” and “negative". It has been adopted at all stages of the work. The performances of all 33 
classifiers (one per each consonant) have been averaged to obtain the performance of each method. The final 
classification model for each method is the assembly of the classifiers as described in (1). Such a tactic was 
adopted since it has proven to be highly effective in comparison to direct multi-class classification in many 
other [22] applications. Each Khmer character has been treated based on the corresponding root radical 
(consonant) as a sample of that consonant. Since 17 vowels were combined with each consonant, there were 
17 samples in each class. 


3.2. The proposed model 

This study introduces a convolutional neural net with a compact architecture: two convolutional 
layers and one fully-connected layer. The model is referred to as "2+1CNN", for brevity. The model is built 
ground-up and is initialized with random weights. 2+1CNN is based on a one-against-all tactic and designed 
for binary classification. 


3.3. Proposed model architecture 

2+1CNN was proposed as a compact model for Khmer HTR and is expected to ease the burden of 
computational requirements while staying close to the guidelines of previously designed successful 
architectures [1], [11]-[13], [23]. Local receptive fields of size 5x5 have been used in convolutional layers. 
Maximal pooling out of 2x2 patches has been used after each convolutional layer. Convolutional and pooling 
layers were kept as simple as possible to reduce the number of computations per filter. The input size was 
kept the same as in the previous research [11]. 
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To prevent overfitting, 50% of the nodes in the fully connected layer are dropped out in random 
order. Rectified linear unit (ReLU) is used as an activation function, due to the simplicity of differentiation 
and its behavior close to other activation functions. Hyper-parameters used in 2+1CNN are being as: 

— Input images are pre-processed, resized to 224x224. 

— First convolutional layer with ReLU as activation function, 5x5 filters with stride size 1. 

— First pooling layer with 2x2 filters, stride size 1. 

- Second convolutional layer with ReLU activation, 5x5 filters, stride size 2. 

— Second pooling layer with 2x2 filters, stride size 1. 

— The dropout stage randomly erases 50% of the perceptron, to reduce overfitting. 

— The fully connected layer is made of 463 perceptron’s with the ReLU activation function. The choice 
for the number is based on average (number of features + number of samples) / 2. 

Table 2 illustrates the structure of 2+1CNN. The values R, W, P, and F were obtained per (2). Filter 
sizes are chosen to minimize the number of computations required during model training. Figure 3 gives the 
visualization of a sample as it is traversed through each layer in 2+1CNN. Represented layers are input, 
convolution, pooling, convolution, pooling. All other models used in this work (LeNet, AlexNet, VGG16, 
VGG19, RESNET) were also modified so that the number of output classes was reduced to two. This 
modification was done to implement binary classification due to the adopted one-against-all tactic. While 
2+1CNN is built ground-up, transfer learning was used to retrain the State-of-the-Art methods on Khmer 
samples. Due to limitations of available processing power and a high amount of data, training of all 
classifiers has been limited to 500 iterations. 


Table 2. Structure of 2+1CNN 


Input (WxW) Padding (PxP) Filter size (FxF) Stride size (SxS) Output (RxR) 
Data 224x224 
Conv-1, ReLU 224x224 2x2 5x5 1x1 223x223 
Pooling-1 223x223 0x0 1x1 2x2 111x111 
Cony-2, ReLU Tx) 2x2 5x5 2x2 55x55 
Pooling-2 55x55 0x0 1x1 2x2 27x27 
Dropout 27x27 - - - 27/2 
FC Layer 365 = 2 7 1 


Figure 3. Visualization of a sample within 2+1CNN 


3.4. Performance evaluation 

The performance of each classifier was quantified by the recognition rate on the testing data set: the 
ratio of the number of samples recognized correctly to the total number of samples. To ensure the robustness 
of each model, cross-validation was applied in four-folds. To measure the performance of an assembly of 
classifiers, the average of their recognition rates was taken, as per (1). 


3.5. System specifications 
System hardware used in experimentation: Windows 7 64 bit, 4GB RAM, Intel Core 13-220 2.20 
GHz CPU. CNN architecture was implemented in Keras, with TensorFlow backend. 


4. RESULTS AND DISCUSSION 

DL models were applied to each of the existing data sets, individually and are compared by 
recognition rate at each data set. The main finding of the study is that the compact model 2+1CNN is highly 
effective. The recognition rate came out to be more than 94% on average, on par with the other models. This 
proves the concept of an ad-hoc CNN-based recognition system, that can be designed in a setting with low 
computational and capabilities. The implications are important for the applications in growing economies, 
like Cambodian, where developers and data engineers have limited access to high-performance technology. 
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Table 3 compares the hardware used in previous experiments to that of the current work. The overall 
comparison of models is given in Table 4. Table 5 shows the comparison of current work against previous 
attempts. It highlights the theoretical progress in the field of handwritten text recognition for Abugida writing 
systems, including Khmer. In previous attempts, data was collected either by scanning printed text or 
drawing with a computer mouse, which poses difficulty representing common handwriting. The results of the 
current HTR task were achieved on a hardware system of lesser specifications. 


Table 3. System requirements for deep learning applications 


Literature System setting Model Data Set, size Result 
Current ; 2+1CNN, LeNet, Accuracy: 94.9%, 
ee Intel core 13, 2.20GHz, RAM:4GB AlexNet VGG16, Khmer Chars, 3366 97.1%, 97.6%, 96.4%, 
VGG19, Resnet 95.8%, 100% 
[11] GTX 580, GPU 3GB AlexNet ImageNet, 1.4M Error: 15.3% 
[12] 4xNVIDIA Titan Black GPU VGG16, VGG19 ImageNet, 1.4M Error: 12% 
[24] 8xGPU ResNet gests CIFAR- Error: 3.57%, 6.97% 
[25] GeForce Titan X Pascal GPU LeNet IAM, 115k, RIMES, 12k Error: 12.7%, 6.6% 
[26] Inve! One oak eo ResNet Bangla, 200k Error: 5.5% 
[27] GTX TITAN X GPU ResNet ICDAR- 2013, 462 Accuracy: 97.03% 
[28] Intel 17-4600U, 16 GB RAM ResNet ICDAR- 2011, 7166 F-score: 90.18-96.88 
[29] GTX Titan X ResNet, VGG16 Georgian HWT, 200k Accuracy: 95%, 89% 
[30] Inte} eee a FoGR AlexNet Iranshahr, 15k Accuracy: 99.13% 
Table 4. Overall summary on CNN-based model 
Model Average recognition rate (%) Convolutional layers Fully-Connected layers 
2+1 CNN 94.9 2 1 
LeNet-5 97.2 2 3 
AlexNet 97.6 5 3 
VGGI16 96.4 13 3 
VGG19 95.8 16 3 
ResNet50 100 49 1 
Table 5. Previous attempts to develop a classifier for abugida-type texts 
Literature Dataset Data and size Classifier Accuracy 
Sok and Taing [3] Printed and scanned text Khmer Characters, 3000 SVM 98% 
Ye et al. [8] Mouse drawn, stylus Khmer, 135, Myanmar, 107 Stock methods Writing speed 
Thumwain et al. [6] Scanned text Khmer symbols, 6750 Distance-Based 98% 
Kruy and Kameyama [14] Printed and scanned text Khmer words, 1104 SIFT, distance-based 98% 
Meng and Morariu [2] Printed and scanned text Khmer Characters, 215 ANN 65% 
Kheang et al. [15] Printed and scanned text Khmer words, 110713 WFST ~73% 
Srun [4] Printed and scanned text Khmer Characters, 33 ANN 97% 
Annanurov and Noor [9], Handwritten Characters Khmer Syllables, 3366 AN feature Higher 
[10] extraction performance 
2+1 CNN Handwritten Characters Khmer Syllables, 3366 2+1 CNN 94.9% 


5. CONCLUSION 

This work aimed to develop a compact and effective model for offline recognition of Khmer 
handwritten characters. In general, recognition rates came out to be 93-98%. The 2+1CNN model was built 
ground-up and had performance over 94%, which is at the same level as other, more sophisticated models. 
The results also helped towards closing the research gap in the field since, at the time of experiments, Khmer 
HTR has not yet been approached with deep learning. The main contribution is the compact Khmer HTR 
model (2+1CNN) with low computational requirements, which is based on open-source software and does 
not require any proprietary packages. These aspects ease its implementation, therefore, allowing swift 
digitization of document corpora in rural and developing areas. The developed models may be applied in a 
high-end OCR application targeted to the general public, as well used in more sophisticated applications with 
only the back-end part, aiming to digitize documents. Further works may include recognition based on the 
information about the layout of documents, forms, tables. 
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